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Abstract 

Covariances from categorical variables are defined using a regular sim- 
plex expression for categories. The method follows the variance definition 
by Gini, and it gives the covariance as a solution of simultaneous equa- 
tions. The calculated results give reasonable values for test data. A 
method of principal component analysis (RS-PCA) is also proposed us- 
ing regular simplex expressions, which allows easy interpretation of the 
principal components. The proposed methods apply to variable selection 
problem of categorical data USCensusl990 data. The proposed methods 
give appropriate criterion for the variable selection problem of categorical 



1 Introduction 

There are large collections of categorical data in many applications, such as 
information retrieval, web browsing, telecommunications, and market basket 
analysis. While the dimensionality of such data sets can be large, the variables 
(or attributes) are seldom completely independent. Rather, it is natural to 
assume that the attributes are organized into topics, which may overlap, i.e., 
collections of variables whose occurrences are somehow correlated to each other. 

One method to find such relationships is to select appropriate variables and 
to view the data using a method like Principle Components Analysis (PCA) [J]. 
This approach gives us a clear picture of the data using KL-plot of the PCA. 
However, the method is not settled for the data including categorical data. 
Multinomial PCA [2] is analogies to PCA for handling discrete or categorical 
data. However, Multinomial PCA is a method based on the parametric model 
and it is difficult to construct a KL-plot for the estimated result. Multiple Cor- 
respondence Analysis (MCA) [3] is analogous to PCA and can handle discrete 
categorical data. MCA is also known as homogeneity analysis, dual scaling, or 
reciprocal averaging. The basic premise of the technique is that complicated 
multivariate data can be made more accessible by displaying their main regu- 
larities and patterns as plots (" KL-plot" ) . MCA is not based on a parametric 
model and can give a "KL-plot" for the estimated result. In order to represent 
the structure of the data, sometimes we need to ignore meaningless variables. 
However, MCA does not give covariances or correlation coefficients between a 
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Table 1: Fisher's data 



•^eye \ -^hair 


fair 


red 


medium 


dark 


black 


blue 


326 


38 


241 


110 


3 


light 


688 


116 


584 


188 


4 
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343 


84 


909 


412 


26 


dark 


98 


48 


403 


681 


85 



pair of categorical variables. It is difficult to obtain criteria for selecting appro- 
priate categorical variables using MCA. 

Symbolic Data Analysis [HI [S] is one of the methods to give multivariate 
descriptive for categorical data. However we forces on more intuitive method 
which can give an understandable plot like K-L plot. 

In this paper, we introduce the covariance between a pair of categorical 
variables using the regular simplex expression of categorical data. This can give 
a criterion for selecting appropriate categorical variables. We also propose a 
new PCA method for categorical data. 

2 Gini's Definition of Variance and its Exten- 
sion 

Let us consider the contingency table shown in Table (TJ which is known as 
Fisher's data [5] on the colors of the eyes and hair of the inhabitants of Caith- 
ness, Scotland. The table represents the joint population distribution of the 
categorical variable for eye color Xeye and the categorical variable for hair color 

^hair - 

Xhair S { fair red medium dark black} 
Xeye G { bluc light medium dark}. (1) 

Before defining the covariances among such categorical variables, ahair,eye, let 
us consider the variance of a categorical variable. Gini successfully defined the 
variance for categorical data [5]. 

^ N N 
a=l b=l 

where, an is the variance of the i-th variable, Xia is the value of Xi for the 
a-th instance, and N is the number of instances. The distance of a categorical 
variable between instances is defined = if their values are identical, 

and = 1 otherwise. A simple extension of this definition to the covariance aij 
by replacing {xia — Xibf to [xia —Xib){xja —Xjb) does not give reasonable values 
for the covariance aij ilOJ. In order to avoid this difhculty, we extended the 
definition based on scalar values, Xia — xn,, to a new definition using a vector 
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expression [TU]. The vector expression for a categorical variable with three 
categories Xi € {r\,rl,rl} was defined by placing these three categories at the 
vertices of a regular triangle. 

A regular simplex can be used for a variable with more than four categories. 
This is a straightforward extension of a regular triangle when the dimension of 
space is greater than two. For example, a regular simplex in the 3-dimensional 
space is a regular tetrahedron. Using a regular simplex, we can extend and 
generalize the definition of covariance to 

Definition 1. The covariance between a categorical variable Xi G {r\,r2, ■■■rl..} 
with ki categories and a categorical variable xj G {rj, rj, •■■?'(..} with kj categories 
is defined as 

, 1 
cTij — niax(- 



^ {^^^{x.,)-^^^{x,,))V\v^^{x,a)-^^'{x,b)n (3) 

a=l...Arfc=l...A' 

where v"(rfe) is the position of the k-th vertex of a regular (n — l)-simplex [1]. 

denotes the k-th element of the i-th categorical variable Xi. is a unitary 
matrix expressing the rotation between the regular simplexes for Xi and Xj . 

Definition [T] includes a procedure to maximize the covariance. Using La- 
grange multipliers, this procedure can be converted into a simpler problem of 
simultaneous equations, which can be solved using the Newton method. The 
following theorem enables this problem transformation. 

Theorem 2. The covariance between categorical variable Xi with ki categories 
and categorical variable xj with kj categories is expressed by 

CTy trace{A'^L'^'), (4) 
where A^^ is {ki — 1) x {kj ~ 1) matrix : 

a b 

is given by the solution of the following simultaneous equations. 



L^L^J'^E (6) 

Proof. Here, we consider the case where ki — kj for the sake of simplicity. 
Definition [T] gives a conditional maximization problem : 

a b 

subject to V^L'^' = E (7) 
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The introduction of Lagrange multipliers A for the constraint U^* ~ E gives 
the Lagrangian function: 

V = trace{A'^ V^^') - trace{A* L'^ L'^' - E), 

where A is fei x ki matrix. A stationary point of the Lagrangian function V is 
a solution of the simultaneous equations ([6]) . □ 

Instead of maximizing ^ with constraint U^*" = E , we can get the 
covariance by solving the equations ([6]), which can be solved easily using the 
Newton method . More efficient way to compute the covariance is the following 
Singular Value Decomposition of matrix A^^ . 

Theorem 3. Let Singular Value Decomposition of matrix A^^ he 

A'^ = UDVK 

The solution of the maximization problem (0j is given 

= UV\ 
Gij — trace{D). 
Application of this method to Table [T] gives 

(^hairjiair = 0.36409, (Jeyejiair = 0.081253, aeye,eye = 0.34985 (8) 

We can derive a correlation coefhcient using the covariance and variance values of 
categorical variables in the usual way. The correlation coefficients for Xeye , Xhair 
for Table [U is 0.2277. 



3 Principal Component Analysis 

3.1 Principal Component Analysis of Categorical Data us- 
ing Regular Simplex (RS-PCA) 

Let us consider categorical variables xi,X2---Xj . For the a-th instance, Xi takes 
value Xia- Here, we represent Xia by the vector of vertex coordinates •v^^{xia)- 
Then, the values of all the categorical variables xi, X2---Xj for the a-th instance 
can be represented by the concatenation of the vertex coordinate vectors of all 
the categorical variables: 



x(a) = (v'=Hxia),v'=^(a;2a),...,v'='^(xja)). 



(9) 



Let us call this concatenated vector the List of Regular Simplex Vertices (LRSV) . 
The covariance matrix of LRSV can be written as 



1 ^ 

a=l 







^12 






5c) ^ 






















AJ2 







(10) 
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where x = J2a=i ^('^) average of the LRSV. The equation shows the 
covariance matrix of LRSV. Since its eigenvalue decomposition can be regarded 
as a kind of Principal Component Analysis (PCA) on LRSV, we call it the 
Principal Component Analysis using the Regular Simplex for categorical data 
(RS-PCA). 

When we need to interpret an eigenvector from RS-PCA, it is useful to 
express the eigenvector as a linear combination of the following vectors. The 
first basis set, d, shows vectors from one vertex to another vertex in the regular 
simplex. The other basis set, c, show vectors from the center of the regular 
simplex to one of the vertices. 

d'^'^ (a ^ &) = v'^^ (6) - v'=^ (a) a, 6 =1,2...% (11) 

c''^ (a) ^ (a) - ^^^^ ^ a =1,2.. .A:, (12) 

Eigenvectors defined in this way change their basis set depending on its direction 
to the regular simplex, but this has the advantage of allowing us to grasp its 
meaning easily. For example, the first two principal component vectors from 
the data in Tabled] are expressed using the following linear combination. 

-0.63 • d'^y" {medium light) - 0.09 • c^^'^(6Zwe) - 0.03 • c''y^{dark) 
0.76 • d^"-''' {medium fair) + 0.07 • d^°-'''{dark -> medium) (13) 
0.64 • d''y^{dark light) - 0.13 • d^^ {medium -> light) 
0.68 • d'*"''(darfc -> medium) + 0.30 ■ c''"''(/air) (14) 

This expression shows that the axis is mostly characterized by the difference 
between x'^^'^ = light and x'^^'^ — medium values, and the difference between 
^hair _ jYigdinfji and a;'"^*'" — fair values. The KL-plot using these components 
is shown in Figure [1] for Fisher's data. In this figure, the lower side is mainly 
occupied by data with values: x'^^'^ = medium or x'"'"' = medium. The upper 
side is mainly occupied by data with values x'^'^^ = light or x'"'"" = fair. 
Therefore, we can confirm that {d'^y^ {medium light) + d'^"'^^ {medium 
fair)) is the first principal component. In this way, we can easily interpret the 
data distribution on the KL-plot when we use the RS-PCA method. 

Multiple Correspondence Analysis (MCA) [7j provides a similar PCA method- 
ology to that of RS-PCA. It uses the representation of categorical values as an 
indicator matrix (also known as a dummy matrix). MCA gives a similar KL- 
plot. However, the explanation of its principal components is difficult, because 
their basis vectors contain one redundant dimension compared to the regular 
simplex expression. Therefore, a conclusion from MCA can only be drawn after 
making a great effort to inspect the KL-plot of the data. 

4 Experimental Results 

We evaluated the performance of our algorithms on a 1990 US census dataset 
( |http://kdd.ics.uci.edu/databases/censusl990/USCen susl99 0.html[ ). 1990 US 
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Figure 1: KL-plot of Fisher's data calculated using RS-PCA. A point is ex- 
pressed by a pair of eye and hair categories: x^'^^ — x^^'^'^ . 
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Table 2: covariance of USCensusl990 





dAncl 
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iCIass 
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0.016 


0.017 
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census dataset is a multivariate categorical data which describes census data 
of US. The data set includes 68 discretized attributes such as age, income, oc- 
cupation, work status, etc. In this experiment, we ignore categorical variable 
"iOthrserv", since this variable has same value on almost all entries. We ran- 
domly selected 3k entries from the 2.5M available entries in the entire data set, 
and apply our method to 67 discretized attributes. Table [2] and [3] show covari- 
ances and correlation coefficients respectively, among some categorical variables 
of 1990 US census dataset given by equation Figure [2] shows eigenvalues 
of a covariance matrix for the 67 categorical variables vs mode number. In 
this figure, only top 20 eigenvalues have large values. This means, almost 20 
categorical variables are sufficient to explain 1990 US census dataset. 

Figure [3] is K-L plot of categorical variables. In this figure, we can see cat- 
egorical variable iRlabor corresponds to 1st principal component and iSex cor- 
responds to 2nd principal component, since these variables close to correspond 
axes. 

In the following, results of RS-PCA are compared focused on these two 
categorical variables iRlabor and iSex. Figure [4] plots result of RS-PCA using 
all 67 variables. Figure [5] is RS-PCA result using first top 20 variablesiiRlabor, 
iSex , and so on. Almost same structure to the result using all variables is 
appeared in this figure. Figure [6] is RS-PCA result using rest 37-67th principal 
components. In this figure, we cannot find similar structure. Figure [7] is RS- 
PCA result using top 5 principal components. We can find similar structure to 
the result using all variables. This results intend that abstracted data structure 
can described by only 5 variables. 

The above mentioned results show that our method can use for variables 
selection of categorical data. 

5 Conclusion 

We studied the covariances between a pair of categorical variables based on 
Gini's definition of the variance for categorical data. The introduction of the 
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Figure 2: eigenvalue vs mode number 



Table 3: Correlation coefficients of USCensusl990 
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Figure 3: KL-plot of USCensus 1990 using all variables 
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Figure 4: RS-PCA of USCensusl990 
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Figure 5: RS-PCA using top 20 principal components 
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Figure 6: RRS-PCA using 37-67th principal components 
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Figure 7: RRS-PCA using top 5 principal components 
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regular simplex expression for categorical values enabled a reasonable definition 
of covariances, and an algorithm for computing the covariance was proposed. 
The regular simplex expression was also shown to be useful in the PCA analysis. 
We showed these merits throiigh numerical experiments using Fisher's data 
and USCensusl990 data. In these experiments, our method applied to variable 
selection problem of categorical data. The experiments showed our method 
gives appropriate criterion for variable selection. 
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